Project Title: Machine Learning Model Development for Weather Forecasting Using Global Historical Climatology Network Daily data.

Objective

The aim of this project is to analyze Global Historical Climatology Network daily (GHCNd) data to find the correlation among different feaures of a given location and use machine learning techniques to forecast a desired feature or more with some existing features as input for the learning problem.

The whole notebook will cover the following sections:

# Topic
01 Data Extraction
02 Data Analysis & Visualization
03 Data Processing
04 ML Model Development & Learning
05 Model Statistics
06 Conclusion

I used the following Python Libraries for analysing the LTI systems:

No Library No Library
01 Pandas 02 Xarray
03 Numpy 04 Metpy
05 Matplotlib 06 Cartopy
07 Matplotlib 3D Axes Toolkit 08 Scikit-Learn
09 Datetime 10 Plotly Express
11 Seaborn 12 Matplotlib Pyplot
13 SciPy Stats 14 Decision Tree Regressor
15 KNeigbors Regressor 16 Linear Regression
17 Multi Output Regressor 18 Cartopy Features
19 plotly Figure Factory 20 Warnings

Let us import all these necessary libraries:

In [1]:
import pandas                   as pd
import numpy                    as np
import matplotlib               as mpl
import matplotlib.pyplot        as plt
from   mpl_toolkits.mplot3d     import Axes3D
import datetime                 as dt
import seaborn                  as sns
import xarray                   as xr
import metpy                    as metpy
import cartopy.crs              as ccrs
import cartopy.feature          as cfeature
import plotly.express           as px
import plotly.figure_factory    as ff
from   scipy.stats              import pearsonr
import warnings                 as warnings

warnings.filterwarnings("ignore", category=UserWarning)
C:\Users\101114993\AppData\Local\anaconda3\Lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated and will be removed in a future release
  "class": algorithms.Blowfish,

Data Extraction

For this project we will use atmospheric dataset for QUILLAYUTE AIRPORT, WA US. The dataset is available at National Centers for Environmental Information. Details of the station information and data record period is given below:

Attributes Description
Name QUILLAYUTE AIRPORT, WA US
Network ID GHCND:USW00094240
Latitude/Longitude 47.93695°, -124.55757°
Elevation 56.4 m

Record Period

Attributes Description
Start Date 1966-08-01
End Date 2025-05-03
Data Coverage 100%

The data is stored in NetCDF (Network Common Data Form) file format, which is used to store and share scientific data, particularly in fields like atmospheric and oceanographic research. To open the data, we need Xarray library. We accesse this data from SD Mines THREDDS Data Server.

In [2]:
##########################################################
#
# Touching our netCDF File
#

ghcnd_file = "http://kyrill.ias.sdsmt.edu:8080/thredds/dodsC/GHCNd/USA/WA/GHCND-USW00094240__QUILLAYUTE_AIRPORT_WA_US.nc"

ds = xr.open_dataset(filename_or_obj  = ghcnd_file,
                     decode_timedelta = True)
del ds["time_bounds"]

display(ds)
#
##########################################################
<xarray.Dataset>
Dimensions:                               (time: 21421)
Coordinates:
  * time                                  (time) datetime64[ns] 1966-08-01T12...
    lat                                   float64 ...
    lon                                   float64 ...
    alt                                   float64 ...
    station_id                            |S64 ...
Data variables: (12/41)
    precipitation_amount                  (time) float32 ...
    thickness_of_snowfall_amount          (time) float32 ...
    surface_snow_thickness                (time) float32 ...
    maximum_air_temperature               (time) float32 ...
    minimum_air_temperature               (time) float32 ...
    dew_point_temperature                 (time) float32 ...
    ...                                    ...
    weather_event_freezing_rain           (time) float32 ...
    weather_event_snow                    (time) float32 ...
    weather_event_unknown_precip          (time) float32 ...
    weather_event_ground_fog              (time) float32 ...
    weather_event_ice_fog                 (time) float32 ...
    station_name                          |S64 ...
Attributes:
    title:                           Global Historical Climatology Network da...
    institution:                     NOAA Center for Environmental Prediction...
    references:                      https://doi.org/10.1175/JTECH-D-11-00103.1
    url:                             https://www.ncei.noaa.gov/products/land-...
    GHCN_Station_Code:               GHCND:USW00094240
    Station_Name:                    QUILLAYUTE AIRPORT, WA US
    Station_Latitude:                47.93695
    Station_Longitude:               -124.55757
    Station_Elevation_in_Meters:     56.4
    Conventions:                     CF-1.12
    featureType:                     timeSeries
    DODS_EXTRA.Unlimited_Dimension:  time
xarray.Dataset
    • time: 21421
    • time
      (time)
      datetime64[ns]
      1966-08-01T12:00:00 ... 2025-03-...
      bounds :
      time_bounds
      _ChunkSizes :
      1339
      array(['1966-08-01T12:00:00.000000000', '1966-08-02T12:00:00.000000000',
             '1966-08-03T12:00:00.000000000', ..., '2025-03-22T12:00:00.000000000',
             '2025-03-23T12:00:00.000000000', '2025-03-24T12:00:00.000000000'],
            dtype='datetime64[ns]')
    • lat
      ()
      float64
      ...
      long_name :
      latitude
      description :
      latitude
      standard_name :
      latitude
      units :
      degrees_north
      [1 values with dtype=float64]
    • lon
      ()
      float64
      ...
      long_name :
      longitude
      description :
      longitude
      standard_name :
      longitude
      units :
      degrees_east
      [1 values with dtype=float64]
    • alt
      ()
      float64
      ...
      long_name :
      station elevation
      description :
      station elevation
      standard_name :
      height
      positive :
      up
      axis :
      Z
      units :
      m
      [1 values with dtype=float64]
    • station_id
      ()
      |S64
      ...
      long_name :
      station id
      description :
      station id
      cf_role :
      timeseries_id
      [1 values with dtype=|S64]
    • precipitation_amount
      (time)
      float32
      ...
      long_name :
      Daily Total Precipitation
      description :
      Daily Total Precipitation
      cell_methods :
      time:sum
      standard_name :
      precipitation_amount
      units :
      kg m-2
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • thickness_of_snowfall_amount
      (time)
      float32
      ...
      long_name :
      Daily Total Snowfall
      description :
      Daily Total Snowfall
      cell_methods :
      time:sum
      standard_name :
      thickness_of_snowfall_amount
      units :
      m
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • surface_snow_thickness
      (time)
      float32
      ...
      long_name :
      Snow Depth on Surface
      description :
      Snow Depth on Surface
      cell_methods :
      time:point
      standard_name :
      surface_snow_thickness
      units :
      m
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • maximum_air_temperature
      (time)
      float32
      ...
      long_name :
      2-m Maximum Daily Air Temperature
      description :
      2-m Maximum Daily Air Temperature
      cell_methods :
      time:maximum
      standard_name :
      air_temperature
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • minimum_air_temperature
      (time)
      float32
      ...
      long_name :
      2-m Minimium Daily Air Temperature
      description :
      2-m Minimium Daily Air Temperature
      cell_methods :
      time:minimum
      standard_name :
      air_temperature
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • dew_point_temperature
      (time)
      float32
      ...
      long_name :
      2-m Average Daily Dew Point Temperature
      description :
      2-m Average Daily Dew Point Temperature
      cell_methods :
      time:mean
      standard_name :
      dew_point_temperature
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • air_pressure_at_sea_level
      (time)
      float32
      ...
      long_name :
      Mean Daily Pressure Reduced to Mean Sea Level
      description :
      Mean Daily Pressure Reduced to Mean Sea Level
      cell_methods :
      time:mean
      standard_name :
      air_pressure_at_sea_level
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • surface_air_pressure
      (time)
      float32
      ...
      long_name :
      Mean Daily Surface Pressure
      description :
      Mean Daily Surface Pressure
      cell_methods :
      time:mean
      standard_name :
      surface_air_pressure
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • wet_bulb_temperature
      (time)
      float32
      ...
      long_name :
      2-m Average Daily Wet Bulb Temperature
      description :
      2-m Average Daily Wet Bulb Temperature
      cell_methods :
      time:mean
      standard_name :
      wet_bulb_temperature
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • mean_wind_speed
      (time)
      float32
      ...
      long_name :
      Mean Daily Wind Speed
      description :
      Mean Daily Wind Speed
      cell_methods :
      time:mean
      standard_name :
      mean_wind_speed
      units :
      m s-1
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • percent_of_possible_sunshine
      (time)
      float32
      ...
      long_name :
      Percent of Potential Sunshine
      description :
      Percent of Potential Sunshine
      cell_methods :
      time:sum
      units :
      percent
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • mean_relative_humidity
      (time)
      float32
      ...
      long_name :
      2-m Average Daily Relative Humidity
      description :
      2-m Average Daily Relative Humidity
      cell_methods :
      time:mean
      standard_name :
      relative_humidity
      units :
      percent
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • min_relative_humidity
      (time)
      float32
      ...
      long_name :
      2-m Minimum Daily Relative Humidity
      description :
      2-m Minimum Daily Relative Humidity
      cell_methods :
      time:minimum
      standard_name :
      relative_humidity
      units :
      percent
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • max_relative_humidity
      (time)
      float32
      ...
      long_name :
      2-m Maximum Daily Relative Humidity
      description :
      2-m Maximum Daily Relative Humidity
      cell_methods :
      time:maximum
      standard_name :
      relative_humidity
      units :
      percent
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • mean_air_temperature
      (time)
      float32
      ...
      long_name :
      2-m Mean Daily Air Temperature
      description :
      2-m Mean Daily Air Temperature
      cell_methods :
      time:mean
      standard_name :
      air_temperature
      units :
      degC
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • duration_of_sunshine
      (time)
      timedelta64[ns]
      ...
      long_name :
      Duration of Sunshine
      description :
      Duration of Sunshine
      cell_methods :
      time:sum
      standard_name :
      duration_of_sunshine
      _ChunkSizes :
      5356
      [21421 values with dtype=timedelta64[ns]]
    • liquid_water_content_of_surface_snow
      (time)
      float32
      ...
      long_name :
      Liquid Snow Water Equivalent Depth on Surface
      description :
      Liquid Snow Water Equivalent Depth on Surface
      cell_methods :
      time:point
      standard_name :
      liquid_water_content_of_surface_snow
      units :
      kg m-2
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • maximum_1_minute_wind_speed
      (time)
      float32
      ...
      long_name :
      wind_speed
      description :
      wind_speed
      cell_methods :
      time:maximum
      standard_name :
      wind_speed
      units :
      m s-1
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • maximum_2_minute_wind_speed
      (time)
      float32
      ...
      long_name :
      wind_speed
      description :
      wind_speed
      cell_methods :
      time:maximum
      standard_name :
      wind_speed
      units :
      m s-1
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • wind_speed_of_gust
      (time)
      float32
      ...
      long_name :
      wind_speed_of_gust
      description :
      wind_speed_of_gust
      cell_methods :
      time:maximum
      standard_name :
      wind_speed_of_gust
      units :
      m s-1
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_fog
      (time)
      float32
      ...
      long_name :
      WX: Fog, Ice Fog, or Freezing Fog
      description :
      WX: Fog, Ice Fog, or Freezing Fog
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_heavy_fog
      (time)
      float32
      ...
      description :
      WX: Heavy Fog or Heavy Freezing Fog
      cell_methods :
      time:point
      long_name :
      WX: Heavy Fog or Heavy Freezing Fog
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_thunder
      (time)
      float32
      ...
      long_name :
      WX: Thunder
      description :
      WX: Thunder
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_ice_pellets
      (time)
      float32
      ...
      long_name :
      WX: Ice Pellets, Sleet, Snow Pellets, or Small Hail
      description :
      WX: Ice Pellets, Sleet, Snow Pellets, or Small Hail
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_hail
      (time)
      float32
      ...
      long_name :
      WX: Hail
      description :
      WX: Hail
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_rime
      (time)
      float32
      ...
      description :
      WX: Glaze or Rime
      cell_methods :
      time:point
      long_name :
      WX: Glaze or Rime
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_dist
      (time)
      float32
      ...
      long_name :
      WX: Dust, Volcanic Ash, Blowing Dust, Blowing Sand, or Blowing Obstruction
      description :
      WX: Dust, Volcanic Ash, Blowing Dust, Blowing Sand, or Blowing Obstruction
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_smoke
      (time)
      float32
      ...
      long_name :
      WX: Smoke or Haze
      cell_methods :
      time:point
      description :
      WX: Smoke or Haze
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_blowing_snow
      (time)
      float32
      ...
      long_name :
      WX: Blowing or Drifting Snow
      description :
      WX: Blowing or Drifting Snow
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_tornado
      (time)
      float32
      ...
      description :
      WX: Tornado, Waterspout, or Funnel Cloud
      cell_methods :
      time:point
      long_name :
      WX: Tornado, Waterspout, or Funnel Cloud
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_high_wind
      (time)
      float32
      ...
      long_name :
      WX: High or Damaging winds
      description :
      WX: High or Damaging winds
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_mist
      (time)
      float32
      ...
      long_name :
      WX: Mist
      description :
      WX: Mist
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_drizzle
      (time)
      float32
      ...
      long_name :
      WX: Drizzle
      description :
      WX: Drizzle
      cell_methods :
      time:point
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_freezing_drizzle
      (time)
      float32
      ...
      description :
      WX: Freezing Drizzle
      cell_methods :
      time:point
      long_name :
      WX: Freezing Drizzle
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_rain
      (time)
      float32
      ...
      description :
      WX: Rain
      cell_methods :
      time:point
      long_name :
      WX: Rain
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_freezing_rain
      (time)
      float32
      ...
      description :
      WX: Freezing Rain
      cell_methods :
      time:point
      long_name :
      WX: Freezing Rain
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_snow
      (time)
      float32
      ...
      description :
      WX: Snow, Snow pellets, Snow grains, or Ice Crystals
      cell_methods :
      time:point
      long_name :
      WX: Snow, Snow pellets, Snow grains, or Ice Crystals
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_unknown_precip
      (time)
      float32
      ...
      description :
      WX: Unknown Source of Precipitation
      cell_methods :
      time:point
      long_name :
      WX: Unknown Source of Precipitation
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_ground_fog
      (time)
      float32
      ...
      description :
      WX: Ground Fog
      cell_methods :
      time:point
      long_name :
      WX: Ground Fog
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • weather_event_ice_fog
      (time)
      float32
      ...
      cell_methods :
      time:point
      long_name :
      WX: Ice Fog or Freezing Fog
      description :
      WX: Ice Fog or Freezing Fog
      _ChunkSizes :
      5356
      [21421 values with dtype=float32]
    • station_name
      ()
      |S64
      ...
      long_name :
      station name
      description :
      station name
      [1 values with dtype=|S64]
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['1966-08-01 12:00:00', '1966-08-02 12:00:00',
                     '1966-08-03 12:00:00', '1966-08-04 12:00:00',
                     '1966-08-05 12:00:00', '1966-08-06 12:00:00',
                     '1966-08-07 12:00:00', '1966-08-08 12:00:00',
                     '1966-08-09 12:00:00', '1966-08-10 12:00:00',
                     ...
                     '2025-03-15 12:00:00', '2025-03-16 12:00:00',
                     '2025-03-17 12:00:00', '2025-03-18 12:00:00',
                     '2025-03-19 12:00:00', '2025-03-20 12:00:00',
                     '2025-03-21 12:00:00', '2025-03-22 12:00:00',
                     '2025-03-23 12:00:00', '2025-03-24 12:00:00'],
                    dtype='datetime64[ns]', name='time', length=21421, freq=None))
  • title :
    Global Historical Climatology Network daily (GHCNd)
    institution :
    NOAA Center for Environmental Prediction (NCEI)
    references :
    https://doi.org/10.1175/JTECH-D-11-00103.1
    url :
    https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily
    GHCN_Station_Code :
    GHCND:USW00094240
    Station_Name :
    QUILLAYUTE AIRPORT, WA US
    Station_Latitude :
    47.93695
    Station_Longitude :
    -124.55757
    Station_Elevation_in_Meters :
    56.4
    Conventions :
    CF-1.12
    featureType :
    timeSeries
    DODS_EXTRA.Unlimited_Dimension :
    time

Converting the xarray into pandas dataframe:

In [3]:
# Directly convert the xarray Dataset to a pandas DataFrame and save it as .csv file
df = ds.to_dataframe()
df.to_csv('ghcnd.csv', sep = ',')

Now we want to see whether the data came from one station or multiple station. In addition, we also want to see the features of the dataset and check them for missing values. To check the number of missing values, we will define a function that will calculate the number of instance of missing values for each feature, data start time, data end time, missing value start time etc. and finally output a dataframe containing all those informations.

In [4]:
print(f"Unique Stations are: {np.unique(df["station_name"])}\n",
      f"\nNumber of features in the dataset: {len(df.columns.to_list())}")
Unique Stations are: [b'QUILLAYUTE AIRPORT, WA US']
 
Number of features in the dataset: 45

Data Analysis and Visualization

For a time series dataset, it is very important to know the frequency of the dataset. In many cases, the data scientists need more data than what they have in hand. For such cases they use upsampling technique. For our case, We want to check the frequency of the data. In addition, we want to know whether there is any missing time step in the dataframe or the dataframe is continuous.

In [5]:
# Infer the expected frequency (e.g., 'D' for daily, 'H' for hourly)
frequency = pd.infer_freq(df.index)
print(f"Data Frequency: {frequency}")

# Create a full expected time range
start_time = df.index.min()              # DataFrame start time
end_time   = df.index.max()              # DataFrame end time
full_range = pd.date_range(start=start_time, end=end_time, freq=frequency)

# Find missing timestamps
missing_times = full_range.difference(df.index)
# Find missing timestamps
missing_times = full_range.difference(df.index)
print(f"Missing Timestamps: {len(missing_times)}")

# Check continuity programmatically
is_continuous = df.index.equals(full_range)
print(f"Is the time series continuous? {is_continuous}")
Data Frequency: D
Missing Timestamps: 0
Is the time series continuous? True
In [6]:
def Missing_Summary(dataframe):    
    missing_summary = []                            # Initialize a list to store results
    start_time = dataframe.index.min()              # DataFrame start time
    end_time   = dataframe.index.max()              # DataFrame end time
    
    # Iterate over each feature
    for column in dataframe.columns:
        if dataframe[column].isna().sum() == 0:     # Skip if the column has no missing values
            continue
        
        series = dataframe[column]                  # Extract the series for the current feature      
        missing_times = series[series.isna()].index # Find all times where the value is missing
        
        # Append to summary
        missing_summary.append({
            'Feature'           : column,
            'Start_Time'        : start_time,
            'End_Time'          : end_time, 
            'First_Missing_Time': missing_times.min(),
            'Last_Missing_Time' : missing_times.max(),
            'Missing_Count'     : len(missing_times)
                                })
    missing_summary_df = pd.DataFrame(missing_summary) # Convert to DataFrame

    return missing_summary_df
In [7]:
missing_summary_df = Missing_Summary(df)

missing_summary_df.head(3)
Out[7]:
Feature Start_Time End_Time First_Missing_Time Last_Missing_Time Missing_Count
0 precipitation_amount 1966-08-01 12:00:00 2025-03-24 12:00:00 1999-08-10 12:00:00 2023-02-12 12:00:00 10
1 thickness_of_snowfall_amount 1966-08-01 12:00:00 2025-03-24 12:00:00 1996-12-01 12:00:00 2025-03-24 12:00:00 10310
2 surface_snow_thickness 1966-08-01 12:00:00 2025-03-24 12:00:00 1996-12-01 12:00:00 2025-03-24 12:00:00 10339

Now that we know, there are some features which have missing values, we want to know their exact number and what are those features.

In [8]:
incomplete_features = missing_summary_df['Feature'].values.tolist()
print(f"Number of Features with missing values: {len(incomplete_features)}",
      f"\nList of Missing Features {incomplete_features}")
Number of Features with missing values: 20 
List of Missing Features ['precipitation_amount', 'thickness_of_snowfall_amount', 'surface_snow_thickness', 'maximum_air_temperature', 'minimum_air_temperature', 'dew_point_temperature', 'air_pressure_at_sea_level', 'surface_air_pressure', 'wet_bulb_temperature', 'mean_wind_speed', 'percent_of_possible_sunshine', 'mean_relative_humidity', 'min_relative_humidity', 'max_relative_humidity', 'mean_air_temperature', 'duration_of_sunshine', 'liquid_water_content_of_surface_snow', 'maximum_1_minute_wind_speed', 'maximum_2_minute_wind_speed', 'wind_speed_of_gust']

From the missing summary dataframe, we observe that we do not have any feature that starts from the beginning and ends at current time without any NaN field. Our goal was to predict weather event based on the features that are available, but due to missing values, we need to change our objective. From the observation, we see that, some features have value from 1966 to 1996, whereas some other features have values from 2006 to 2025. We created two groups of varibales based on the available data. Our first dataset will contain following features:

  • precipitation_amount
  • maximum_air_temperature
  • minimum_air_temperature
  • thickness_of_snowfall_amount
  • surface_snow_thickness

With this dataset (Period: 1966 to 1996), we will try to find the correlation among precipitation amount, average air temperature, thickness of snowfall amount and surface snow thickness. Our input sequene will have precipitation amount, average air temperature and our goal is to predict the thickness of snowfall amount and surface snow thickness.

Our second data segment (Period: 2006 to 2025) will contain following features:

  • precipitation_amount
  • dew_point_temperature
  • air_pressure_at_sea_level
  • surface_air_pressure
  • maximum_air_temperature
  • minimum_air_temperature

With this segment of dataset, we will try to predict average temperature based on precipitation, dew point temperature, and pressure values.

Snowfall Related Data Segment Analysis

We know the variables for snow fall related data segment, recording period of it.

In [9]:
var_snow = [     'precipitation_amount',
                'maximum_air_temperature',
                'minimum_air_temperature',
           'thickness_of_snowfall_amount',
                 'surface_snow_thickness']

dataset_1 = df[var_snow]
dataset_1.head(3)
Out[9]:
precipitation_amount maximum_air_temperature minimum_air_temperature thickness_of_snowfall_amount surface_snow_thickness
time
1966-08-01 12:00:00 0.0 21.1 7.8 0.0 0.0
1966-08-02 12:00:00 0.0 23.9 13.3 0.0 0.0
1966-08-03 12:00:00 0.0 20.0 12.8 0.0 0.0
In [10]:
start_date = '1966-08-01'
end_date = '1996-12-01'
# Filter the DataFrame
dataset_1= dataset_1.loc[start_date:end_date].copy()

# Display the result
dataset_1.head(3)
Out[10]:
precipitation_amount maximum_air_temperature minimum_air_temperature thickness_of_snowfall_amount surface_snow_thickness
time
1966-08-01 12:00:00 0.0 21.1 7.8 0.0 0.0
1966-08-02 12:00:00 0.0 23.9 13.3 0.0 0.0
1966-08-03 12:00:00 0.0 20.0 12.8 0.0 0.0
In [11]:
# Split data based on year threshold
train_end_time  = '1992-12-31'
test_start_time = '1993-01-01'

# Ensure the time column is datetime type (replace 'time' with your actual date column name)
dataset_1.reset_index(inplace=True)
dataset_1['time'] = pd.to_datetime(dataset_1['time'])
dataset_1['sample'] = 'test'  
dataset_1.loc[dataset_1['time'] <= train_end_time, 'sample'] = 'train'
dataset_1.head(5)
Out[11]:
time precipitation_amount maximum_air_temperature minimum_air_temperature thickness_of_snowfall_amount surface_snow_thickness sample
0 1966-08-01 12:00:00 0.0 21.100000 7.8 0.0 0.0 train
1 1966-08-02 12:00:00 0.0 23.900000 13.3 0.0 0.0 train
2 1966-08-03 12:00:00 0.0 20.000000 12.8 0.0 0.0 train
3 1966-08-04 12:00:00 0.0 17.800001 10.6 0.0 0.0 train
4 1966-08-05 12:00:00 0.0 21.100000 7.8 0.0 0.0 train

Now, we want to know how these variables and targets are changing with respect to time. I will use Plotly Express libary to plot the variables and targets with interactive range slider.

In [12]:
import plotly.express as px

def plot(data_frame, x_column, y_column, title, y_unit):
    """Plot time series with unit-aware y-axis and rangeslider"""
    fig = px.line(data_frame, x=x_column, y=y_column, title=title)
    fig.update_layout(
        yaxis_title=y_unit,
        xaxis_title='Time',
        xaxis_rangeslider_visible=True
    )
    fig.show()
    return fig  # Return the figure object instead of showing it

# Define variables, titles, and units
variables = [
    ('maximum_air_temperature', 'Maximum Air Temperature', '°C'),
    ('precipitation_amount', 'Precipitation', 'kg/m²'),
    ('thickness_of_snowfall_amount', 'Snowfall Thickness', 'm'),
    ('surface_snow_thickness', 'Surface Snow Depth', 'm')
]

# Generate and save plots with proper units
for col, title, unit in variables:
    fig = plot(dataset_1, 'time', col, title, unit)
    fig.write_html(f"{col}.html")  

Now we want to know how the targets are changing concerning the input change. First, we will create a new feature by taking the average of minimum and maximum air temperature. Then we will calculate the difference in average temperature ($\Delta T$) between the consecutive days and plot the amount of surface snow thickness and snowfall amount with this $\Delta T$.

In [13]:
dataset_1 = (
    dataset_1
    .assign(Avg_T=lambda x: (x['maximum_air_temperature'] + x['minimum_air_temperature']) / 2)
#    .drop(columns=['maximum_air_temperature', 'minimum_air_temperature'])
)
In [14]:
dataset_1_lag = dataset_1.shift(periods=-1)
dataset_1["Delta_T"] = dataset_1["Avg_T"] - dataset_1_lag["Avg_T"]
fig = px.scatter(x = dataset_1["Delta_T"], y = dataset_1["surface_snow_thickness"], title="Variation of Surface Snow Thickness w.r.t. ΔT")
fig.update_layout(
yaxis_title = "m",
xaxis_title = "∆T",
xaxis_rangeslider_visible=True
)
fig.write_html("SST.html")  
fig.show()
In [15]:
dataset_1_lag = dataset_1.shift(periods=-1)
fig = px.scatter(x = dataset_1["Delta_T"], y = dataset_1["thickness_of_snowfall_amount"], title="Variation of Snowfall Thickness w.r.t. ∆T")
fig.update_layout(
yaxis_title = "m",
xaxis_title ='∆T',
xaxis_rangeslider_visible=True)
fig.show()

From the above figures, we can say that the distribution of the targeted outputs concerning the change in temperature varies according to normal distribution.

Now we want to know what kind of distribution the $\Delta T$ has and how it's changing during no snow day and snow day. To learn about the variation of average temperature, let us plot the Kernel Density Estimate plot.

In [16]:
dataset_1_nosnow = dataset_1[dataset_1["thickness_of_snowfall_amount"]==0]

fig1 = ff.create_distplot(
    [dataset_1_nosnow["Delta_T"]], 
    ['No Snow'], 
    show_hist=True, 
    show_rug=True
)
fig1.add_vline(x=0, line_dash="dash", line_color="black")
fig1.update_layout(
yaxis_title = "Density",
xaxis_title ='∆T',
title = "KDE Plot of ∆T for No Snow day",
xaxis_rangeslider_visible=True
)
fig1.write_html("KDE_Delta_T.html")  
fig1.show()

From above KDE plot, we see that:

  • Distribution Center: Peaks near 0°C $\Delta T$
  • Density: ~0.05 to 0.2
  • Spread: -5°C to +5°C

This means:

  • Most temperature differences cluster near $\Delta T$ = 0 (highest density at center)
  • Moderate spread (±5°C temperature variation)
In [17]:
dataset_1_snow   = dataset_1[dataset_1["thickness_of_snowfall_amount"]>0]

fig2 = ff.create_distplot(
    [dataset_1_snow["Delta_T"]], 
    ['Snow'], 
    show_hist=True, 
    show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Density",
xaxis_title ='∆T',
title = "KDE Plot of ∆T for Snow day",
xaxis_rangeslider_visible=True
)
fig2.write_html("KDE_∆T.html")  
fig2.show()

From snow day KDE plot, we observe that:

  • Distribution Center: Peaks near 0°C Delta_T
  • Density: ~0 to 0.18
  • Spread: -8°C to +6°C
  • Left Skew: Longer tail towards negative values

In addition, the 0°C reference line shows:

  • Majority of snow days have some temperature variation
  • Completely stable temperatures $\Delta T$ = 0 are rare

So it is evident that no snow days have:

  • Narrower distribution (taller peak at 0°C) than snow days and
  • Less negative skew

Now we want to quantify the distribution characteristics of $\Delta T$. To do so, we will use SciPy Stats library.

In [18]:
snow_stats = {
    "mean": dataset_1_snow["Delta_T"].mean(),
    "std": dataset_1_snow["Delta_T"].std(),
    "skew": dataset_1_snow["Delta_T"].skew(),
    "kurtosis": dataset_1_snow["Delta_T"].kurtosis()
}

print("Snow Day Statistics:")
for k,v in snow_stats.items():
    print(f"{k}: {v:.2f}")

# Compare with no-snow days
print("\nProbability of Delta_T < 0:") 
print("Snow days:", (dataset_1_snow["Delta_T"] < 0).mean())
print("No-snow days:", (dataset_1_nosnow["Delta_T"] < 0).mean())
Snow Day Statistics:
mean: -0.25
std: 2.58
skew: -0.51
kurtosis: 0.52

Probability of Delta_T < 0:
Snow days: 0.47440273037542663
No-snow days: 0.4676925929359414

From the above test, we see that the probability of a temperature drop ($\Delta T < 0$) is very similar between snow days and no-snow days. This suggests that temperature drops alone are not a strong predictor of snow occurrence in the dataset. Finally, to visualize the correlation among different features and observe the data distribution pattern, let us do a pair plot.

In [19]:
fig = px.scatter_matrix(dataset_1,
    dimensions=["precipitation_amount",
                "Avg_T",
                "thickness_of_snowfall_amount",
                "surface_snow_thickness",
                ],
    color="sample", symbol="sample",
    title="Scatter matrix of Snow Dataset",
    labels={col:col.replace('_', ' ') for col in dataset_1.columns})
fig.update_traces(diagonal_visible=False)
fig.update_layout(title="Pair Plot of Snow Related Atmospheric Dataset",
                  dragmode='select',
                  width=1000,
                  height=1000,
                  hovermode='closest')
fig.write_html(f"Pair_Plot.html")
fig.show()

Average Temperature Prediction Data Segment Analysis

In [20]:
temp_vars = [          'precipitation_amount',
                      'dew_point_temperature',
                  'air_pressure_at_sea_level',
                       'surface_air_pressure',
                    'maximum_air_temperature',
                    'minimum_air_temperature'
            ]
dataset_2 = df[temp_vars]
dataset_2.head(3)
Out[20]:
precipitation_amount dew_point_temperature air_pressure_at_sea_level surface_air_pressure maximum_air_temperature minimum_air_temperature
time
1966-08-01 12:00:00 0.0 NaN NaN NaN 21.1 7.8
1966-08-02 12:00:00 0.0 NaN NaN NaN 23.9 13.3
1966-08-03 12:00:00 0.0 NaN NaN NaN 20.0 12.8

This dataset's recording period starts from 2006 to current time.

In [21]:
start_date = '2006-01-01'
end_date = '2025-03-24'
dataset_2 = dataset_2.loc[start_date:end_date].copy()

Let us split the dataset into train and test sets. We chose to train the model with data from 2006 to 2022 and the rest of the data will be used as test data.

In [22]:
# Split data based on year threshold
train_end_time  = '2022-12-31'
test_start_time = '2023-01-01'

# Ensure the time column is datetime type (replace 'time' with your actual date column name)
dataset_2.reset_index(inplace=True)
dataset_2['time'] = pd.to_datetime(dataset_2['time'])
dataset_2['sample'] = 'test'  
dataset_2.loc[dataset_2['time'] <= train_end_time, 'sample'] = 'train'
dataset_2.head(3)
Out[22]:
time precipitation_amount dew_point_temperature air_pressure_at_sea_level surface_air_pressure maximum_air_temperature minimum_air_temperature sample
0 2006-01-01 12:00:00 16.500000 6.1 989.200012 981.700012 10.600000 6.7 train
1 2006-01-02 12:00:00 5.100000 4.4 1002.400024 994.900024 8.900001 2.2 train
2 2006-01-03 12:00:00 8.900001 3.9 1008.799988 1001.400024 8.300000 3.3 train

We are not going to plot how these variables are changing with respect to time, instead we will check how the precipitation amount changes with the change in surface air pressure.

In [23]:
dataset_2_lag = dataset_2.shift(periods=-1)
dataset_2["Delta_P"] = dataset_2["surface_air_pressure"] - dataset_2_lag["surface_air_pressure"]
In [24]:
dataset_2_prec   = dataset_2[dataset_2["precipitation_amount"]>0].dropna()

fig2 = ff.create_distplot(
    [dataset_2_prec["Delta_P"]], 
    ['Precipitation'],
    show_hist=True, 
    show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Probability Density",
xaxis_title ='Delta P',
title = "KDE Plot of Delta P for Precipitation",
xaxis_rangeslider_visible=True
)
fig2.show()

From KDE plot, we observe that:

  • Unimodal Distribution
  • Distribution Center: Peaks near 0 $\Delta P$
  • Density: ~0 to 0.07
  • Spread: -20 inHg to +20 inHg

Let's plot the same KDE for Average Temperature Variation with repect to $\Delta P$.

In [25]:
dataset_2 = (
    dataset_2
    .assign(Avg_T=lambda x: (x['maximum_air_temperature'] + x['minimum_air_temperature']) / 2)
#    .drop(columns=['maximum_air_temperature', 'minimum_air_temperature'])
)

dataset_2_AvgT = dataset_2[dataset_2["Avg_T"]>0].dropna()

fig2 = ff.create_distplot(
    [dataset_2_AvgT["Delta_P"]], 
    ['Temperature'],
    show_hist=True,
    show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Probability Density",
xaxis_title ='Delta P',
title = "KDE Plot of Delta P for Average Temperature",
xaxis_rangeslider_visible=True
)
fig2.show()

From KDE plot, we observe that:

  • Unimodal Distribution
  • Distribution Center: Peaks near 0 $\Delta P$
  • Density: ~0 to 0.087
  • Spread: -20 inHg to +20 inHg

Now, let us find the Pearson Correlation for this data segment and plot it.

In [26]:
correlation_matrix2 = dataset_2.drop(["sample",
                                      "time",
                                      "Avg_T",
                                      "minimum_air_temperature",
                                      "maximum_air_temperature",
                                      "Delta_P"],
                                      axis=1).corr(method='pearson')
print(correlation_matrix2)
                           precipitation_amount  dew_point_temperature  \
precipitation_amount                   1.000000               0.060049   
dew_point_temperature                  0.060049               1.000000   
air_pressure_at_sea_level             -0.353134              -0.099304   
surface_air_pressure                  -0.358946              -0.092882   

                           air_pressure_at_sea_level  surface_air_pressure  
precipitation_amount                       -0.353134             -0.358946  
dew_point_temperature                      -0.099304             -0.092882  
air_pressure_at_sea_level                   1.000000              0.996697  
surface_air_pressure                        0.996697              1.000000  
In [27]:
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix2, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Pearson Correlation Matrix for Data Segment 2")
plt.show()
No description has been provided for this image

From the Pearson Correlation, we see that following features are highly correlated:

  • Air Pressure at Sea Level and Surface Air Pressure.

So, we will consider only one of them in the input feature for machine learning purpose.

Data Processing

Now we want our data to be ready for feeding it to the Machine Learning Algorithms. So, we need to remove all the NaN for the dataset.

Data Segment 1

In [28]:
# Counting Null Values
print(f"Null Values in Data:\n\n{dataset_1.isnull().sum()}")

# Remove rows with ANY null values (default behavior)
dataset_1 = dataset_1.dropna()

print(f"Null Values in Data:\n\n{dataset_1.isnull().sum()}")
Null Values in Data:

time                            0
precipitation_amount            0
maximum_air_temperature         0
minimum_air_temperature         0
thickness_of_snowfall_amount    1
surface_snow_thickness          1
sample                          0
Avg_T                           0
Delta_T                         1
dtype: int64
Null Values in Data:

time                            0
precipitation_amount            0
maximum_air_temperature         0
minimum_air_temperature         0
thickness_of_snowfall_amount    0
surface_snow_thickness          0
sample                          0
Avg_T                           0
Delta_T                         0
dtype: int64

Data Segment 2

In [29]:
print(f"Null Values in Data:\n\n{dataset_2.isnull().sum()}")
dataset_2 = dataset_2.dropna()
print(f"Null Values in Data:\n\n{dataset_2.isnull().sum()}")
Null Values in Data:

time                           0
precipitation_amount           9
dew_point_temperature        333
air_pressure_at_sea_level    318
surface_air_pressure         318
maximum_air_temperature        9
minimum_air_temperature        5
sample                         0
Delta_P                      356
Avg_T                         10
dtype: int64
Null Values in Data:

time                         0
precipitation_amount         0
dew_point_temperature        0
air_pressure_at_sea_level    0
surface_air_pressure         0
maximum_air_temperature      0
minimum_air_temperature      0
sample                       0
Delta_P                      0
Avg_T                        0
dtype: int64

Now, we will make a list of inputs and outputs for both data segments.

In [30]:
# Define input features and target variables
inputs_1 = ['precipitation_amount', 'Avg_T']
targets_1 = ['thickness_of_snowfall_amount', 'surface_snow_thickness']

inputs_2 = [         'precipitation_amount',
                      'dew_point_temperature',
                       'surface_air_pressure'
            ]
targets_2 = ['Avg_T']

We split the dataset into train and test.

In [31]:
# Data Segment 1: Split into train/test based on 'sample' column
X_train_1 = dataset_1.loc[dataset_1['sample'] == 'train', inputs_1]
X_test_1 = dataset_1.loc[dataset_1['sample'] == 'test', inputs_1]

y_train_1 = dataset_1.loc[dataset_1['sample'] == 'train', targets_1]
y_test_1 = dataset_1.loc[dataset_1['sample'] == 'test', targets_1]

# Data Segment 2
X_train_2 = dataset_2.loc[dataset_2['sample'] == 'train', inputs_2]
X_test_2 = dataset_2.loc[dataset_2['sample'] == 'test', inputs_2]

y_train_2 = dataset_2.loc[dataset_2['sample'] == 'train', targets_2]
y_test_2 = dataset_2.loc[dataset_2['sample'] == 'test', targets_2]
In [32]:
print(X_train_1.head(3))
"\n"
print(y_train_1.head(3))
"\n \n"
print(X_train_2.head(3))
"\n"
print(y_train_2.head(3))
   precipitation_amount      Avg_T
0                   0.0  14.450001
1                   0.0  18.600000
2                   0.0  16.400000
   thickness_of_snowfall_amount  surface_snow_thickness
0                           0.0                     0.0
1                           0.0                     0.0
2                           0.0                     0.0
   precipitation_amount  dew_point_temperature  surface_air_pressure
0             16.500000                    6.1            981.700012
1              5.100000                    4.4            994.900024
2              8.900001                    3.9           1001.400024
      Avg_T
0  8.650001
1  5.550000
2  5.800000

Let us check the shapes of all train and test sets:

In [33]:
print(f"***********************************************************",
      f"\nData Segment 1",
      f"\n***********************************************************",
      f"\nShape of Input Train Data: {X_train_1.shape}",
      f"\nShape of Targets Train Data: {y_train_1.shape}",
      f"\nShape of Input Test Data: {X_test_1.shape}",
      f"\nShape of Target Test Data: {y_test_1.shape}\n"
      )

print(f"***********************************************************",
      f"\nData Segment 2",
      f"\n***********************************************************",
      f"\nShape of Input Train Data: {X_train_2.shape}",
      f"\nShape of Target Train Data: {y_train_2.shape}"
      f"\nShape of Input Test Data: {X_test_2.shape}",
      f"\nShape of Target Test Data: {y_test_2.shape}"
      )
*********************************************************** 
Data Segment 1 
*********************************************************** 
Shape of Input Train Data: (9649, 2) 
Shape of Targets Train Data: (9649, 2) 
Shape of Input Test Data: (1431, 2) 
Shape of Target Test Data: (1431, 2)

*********************************************************** 
Data Segment 2 
*********************************************************** 
Shape of Input Train Data: (6039, 3) 
Shape of Target Train Data: (6039, 1)
Shape of Input Test Data: (598, 3) 
Shape of Target Test Data: (598, 1)

ML Model Development and Learning

Regression refers to a predictive modeling problem that involves predicting a numerical value. ome regression machine learning algorithms support multiple outputs directly.

This includes most of the popular machine learning algorithms implemented in the scikit-learn library, such as:

  • Multi-output Regressor
  • K-Neighbors Regressor and
  • Decision Tree Regressor

We will feed our dataset into all these regressors, count their training time, mean squared errors and $R^2$ score for train and test set.

Importing ML Models:

In [34]:
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import time

# Initialize models
models_1 = {
    "MultiOutput_LinearRegression": MultiOutputRegressor(LinearRegression()),
    "DecisionTree": DecisionTreeRegressor(),
    "KNeighbors": KNeighborsRegressor()
}

We will write a loop to train our models with the data segments and store different performance metrics for each models. Now, let us initialize an empty dictionary to store Training Time, Mean Squared Errors, $R^2$ Score for train and test sets.

In [35]:
# Dictionary to store all MSE scores
model_stats_1 = {
    'Training_Time' : [],
    'Model': [],
    'Train_MSE': [],
    'Test_MSE': [],
    'train_R²_Score' : [],
    'test_R²_Score' : []
}

Training Data Segment 1 to Predict Snowfall Amount and Surface Snow Thickness

In [36]:
for model_name, model in models_1.items():
    start = time.time()
    # Train model
    model.fit(X_train_1, y_train_1)
    
    end = time.time()
    
    model_stats_1[f'Training_Time'] = end - start

    # Get predictions
    train_pred = model.predict(X_train_1)
    test_pred = model.predict(X_test_1)
    
    # Calculate MSE for each output
    train_mse = mean_squared_error(y_train_1,
                                   train_pred)
    test_mse = mean_squared_error(y_test_1,
                                  test_pred)

    model_stats_1[f'Train_MSE'].append(train_mse)
    model_stats_1[f'Test_MSE'].append(test_mse)

    train_r2 = model.score(X_train_1,
                           y_train_1)
    test_r2 = model.score(X_test_1,
                          y_test_1)
 
    model_stats_1[f'train_R²_Score'].append(train_r2)
    model_stats_1[f'test_R²_Score'].append(train_r2)

    model_stats_1['Model'].append(model_name)

Training Data Segment 2 for Predicting Average Temperature

In [37]:
# Initialize models
models_2 = {
    "LinearRegression": LinearRegression(),
    "DecisionTree": DecisionTreeRegressor(),
    "KNeighbors": KNeighborsRegressor()
}

# Dictionary to store all MSE scores
model_stats_2 = {
    'Training_Time' : [],
    'Model': [],
    'Train_MSE': [],
    'Test_MSE': [],
    'train_R²_Score' : [],
    'test_R²_Score' : []
}
In [38]:
for model_name, model in models_2.items():
    start = time.time()
    # Train model
    model.fit(X_train_2, y_train_2)
    
    end = time.time()
    
    model_stats_2[f'Training_Time'] = end - start

    # Get predictions
    train_pred = model.predict(X_train_2)
    test_pred = model.predict(X_test_2)
    
    # Calculate MSE for each output
    train_mse = mean_squared_error(y_train_2,
                                   train_pred)
    test_mse = mean_squared_error(y_test_2,
                                  test_pred)

    model_stats_2[f'Train_MSE'].append(train_mse)
    model_stats_2[f'Test_MSE'].append(test_mse)

    train_r2 = model.score(X_train_2,
                           y_train_2)
    test_r2 = model.score(X_test_2,
                          y_test_2)
 
    model_stats_2[f'train_R²_Score'].append(train_r2)
    model_stats_2[f'test_R²_Score'].append(train_r2)

    model_stats_2['Model'].append(model_name)

Model Statistics

In [39]:
# Convert to DataFrame for better visualization
Model_Statistics_1 = pd.DataFrame(model_stats_1).set_index('Model')
Model_Statistics_2 = pd.DataFrame(model_stats_2).set_index('Model')

print(f"*****************************",
      f"\nStatistics for Data Segment 1",
      f"\n*****************************\n")
Model_Statistics_1.head()
***************************** 
Statistics for Data Segment 1 
*****************************

Out[39]:
Training_Time Train_MSE Test_MSE train_R²_Score test_R²_Score
Model
MultiOutput_LinearRegression 0.00518 0.000193 0.000040 0.044296 0.044296
DecisionTree 0.00518 0.000018 0.000080 0.933264 0.933264
KNeighbors 0.00518 0.000108 0.000036 0.519766 0.519766

From the above statistics, we see that all of the models take the same time to train on the dataset, and Decision Tree outperforms both Multi-Output Linear Regression and KNeighbors.

In [40]:
print(f"*****************************",
      f"\nStatistics for Data Segment 2",
      f"\n*****************************\n")
Model_Statistics_2.head()
***************************** 
Statistics for Data Segment 2 
*****************************

Out[40]:
Training_Time Train_MSE Test_MSE train_R²_Score test_R²_Score
Model
LinearRegression 0.005213 4.613514 5.032811 0.795707 0.795707
DecisionTree 0.005213 1.094333 5.568324 0.951541 0.951541
KNeighbors 0.005213 2.863456 4.624170 0.873202 0.873202

From the above statistics, we see that Multi-Output Linear Regression performs very poorly compared to Decision Tree and KNeighbors. Though Decision Tree provides the best training MSE, its performance is not that good in comparison with the other two on test data. But, Decision Tree provides the best $R^2$ score for both train and test data.

Conclusion

We aimed to analyze the Global Historical Climatology Network's daily data and try to learn the relations among different features. We divided the data into two segments and fed them to different ML models. For our first segment, we fed the ML with average temperature and precipitation amount to predict surface snow thickness and snowfall amount. Decision Tree model outperformed other models in this work. For Decision Tree model, train MSE: 0.000018, Test MSE: 0.000080, Train $R^2$: 0.933264   and  Test $R^2$: 0.933264. In the second data segment, our input features were precipitation amount, dew point temperature, and surface air pressure, and the target was average temperature. This time, KNeighbors outperformed other models with train MSE: 2.863456, Test MSE: 4.624170, Train $R^2$: 0.873202 and  Test $R^2$: 0.873202.